The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

[ INSERT RESEARCH QUESTIONS HERE ]

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries

<<<<<<< Updated upstream
library(rvest)
library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
=======
library(rvest)
library(dplyr)
library(tidyverse)
library(httr)
>>>>>>> Stashed changes

Function to get NBA roster for a specified year

get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}

Example usage

year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
NA
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
    xlab = "Minutes Played",
    ylab = "Points",
    xlim = c(0.0, 48),
    ylim = c(0.0, 35),   
    main = "Minutes Played vs Points Scored"
)

#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
    xlab = "Field Goal Attempts",
    ylab = "Field Goal Made",
    xlim = c(0.0, b_FGA),
    ylim = c(0.0, b_FG),     
    main = "Field Goal Attempt vs Field Goal Made"
)

NA
NA
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

ASSIGNMENT 1: Is the data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,

Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from ” chararcter” to “double”

For players who are missing data

nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))
NA
NA
NA

To determine whether a player is “top tier” and should be considered a part of a “Big 3” lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.


# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)
NA
NA

ASSIGNMENT 2: Is the advanced data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,

cleaning similar to first one

The script below provide code to clean out the quality issues presented in the dataframe

#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]
Error: object 'dataframe' not found

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))
Warning: There were 22 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(G:VORP, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 21 remaining warnings.

ASSIGNMENT 3: Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.

<<<<<<< Updated upstream
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by = c("Rk", "Player", "Pos","Age", "Tm","G"))#, by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
=======

#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}
>>>>>>> Stashed changes
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

ASSIGNMENT 4: Make a function with argument year that outputs one dataframe with the merged traditional and advanced data.

<<<<<<< Updated upstream

combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)

=======


#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}

get_cleaned_nba_stats <- function(year) {
  # Fetch totals and advanced stats
  nba_totals <- get_nba_totals_stats(year)
  nba_advanced <- get_nba_advanced_stats(year)

  # Print to check datasets (Optional)
  print("NBA Totals:")
  print(head(nba_totals))
  print("NBA Advanced:")
  print(head(nba_advanced))
>>>>>>> Stashed changes

  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


<<<<<<< Updated upstream
#take out the N/A 
nba_roster2<-na.omit(nba_roster2)
=======
  # Debug: Show merged data sample
  print("Merged Data Sample:")
  print(head(nba_merge))
  
  # Check column names to confirm 'Team' exists
  print("Column Names Before Renaming:")
  print(colnames(nba_merge))

  # Clean 'Team' column: rename 'Tm' to 'Team' if present
  if ("Tm" %in% colnames(nba_merge)) {
    nba_merge <- nba_merge %>%
      rename(Team = Tm)
  }

  # Debug: Show column names after renaming
  print("Column Names After Renaming 'Tm' to 'Team':")
  print(colnames(nba_merge))

  # Handle duplicate columns like Team.x, Team.y, Awards.x, Awards.y
  duplicate_columns <- colnames(nba_merge)[grepl("\\.x$", colnames(nba_merge))]

  for (col in duplicate_columns) {
    # Extract the base name of the column (e.g., "Team" from "Team.x")
    base_col <- sub("\\.x$", "", col)
    
    # Merge the .x and .y columns into one
    if (paste0(base_col, ".y") %in% colnames(nba_merge)) {
      nba_merge <- nba_merge %>%
        mutate(!!base_col := coalesce(get(col), get(paste0(base_col, ".y")))) %>%
        select(-all_of(c(col, paste0(base_col, ".y"))))  # Drop the old columns
    }
  }
>>>>>>> Stashed changes


# Convert specific columns from character to double

<<<<<<< Updated upstream
nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}
======= # Remove the 'X', 'X.1' columns if they exist columns_to_remove <- c("X", "X.1") nba_merge <- nba_merge %>% select(-any_of(columns_to_remove)) # Remove specified columns if they exist # Merge 'Rk.x' and 'Rk.y' columns if ("Rk.x" %in% names(nba_merge) & "Rk.y" %in% names(nba_merge)) { nba_merge <- nba_merge %>% mutate(Rk = coalesce(as.character(Rk.x), as.character(Rk.y))) %>% select(-Rk.x, -Rk.y) } # Merge 'Age.x' and 'Age.y' columns if ("Age.x" %in% names(nba_merge) & "Age.y" %in% names(nba_merge)) { nba_merge <- nba_merge %>% mutate(Age = coalesce(as.numeric(Age.x), as.numeric(Age.y))) %>% select(-c(Age.x, Age.y)) } # Reorder columns for clarity column_order <- c("Player", "Pos", "Age", "Rk", "G", "MP", "Team") nba_merge <- nba_merge %>% select(all_of(column_order), everything()) # Return the cleaned and merged dataset return(nba_merge) } # Example usage nba_data_2013 <- get_cleaned_nba_stats(2013)
>>>>>>> Stashed changes
Error in value[[3L]](cond) : 
  Error fetching webpage: could not find function "GET"
<<<<<<< Updated upstream =======
nba_data_1984<-get_cleaned_nba_stats(1984)
Error in open.connection(x, "rb") : 
  Could not resolve host: www.basketball-reference.com
>>>>>>> Stashed changes

ASSIGNMENT 5: Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.

<<<<<<< Updated upstream
---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
```

Function to get NBA roster for a specified year
```{r}
get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
```

Example usage
```{r}
year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)

```
```{r}
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
```



```{r}

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig


```
```{r}
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
	xlab = "Minutes Played",
	ylab = "Points",
	xlim = c(0.0, 48),
	ylim = c(0.0, 35),	 
	main = "Minutes Played vs Points Scored"
)

```


```{r}
#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
	xlab = "Field Goal Attempts",
	ylab = "Field Goal Made",
	xlim = c(0.0, b_FGA),
	ylim = c(0.0, b_FG),	 
	main = "Field Goal Attempt vs Field Goal Made"
)


```



```{r}
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 

```


**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


For players who are missing data
```{r}
nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))



```

To determine whether a player is "top tier" and should be considered a part of a "Big 3" lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS 

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.
```{r}

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)


```

**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]

#2 Now we want to remove the filler rows that had been used as headers on the webpage

newdata.frame<-dataframe[-c(502:526), ]

#3 now we want to remove all the N/As from the dataset
dataframe %>% 
  select(where(~!all(is.na(.))))
```
 
 
 
```{r}

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))




```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)


head(nba_merge)
```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 

```{r}

combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


#take out the N/A 
nba_roster2<-na.omit(nba_roster2)


# Convert specific columns from character to double

nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}



```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*



=======
---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
library(httr)

```

Function to get NBA roster for a specified year


Example usage





**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}

```
 
 
 
```{r}

```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}


```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 


Official Cleaning Function that works as of 10/29/2024
```{r}

#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}

get_cleaned_nba_stats <- function(year) {
  # Fetch totals and advanced stats
  nba_totals <- get_nba_totals_stats(year)
  nba_advanced <- get_nba_advanced_stats(year)

  # Print to check datasets (Optional)
  print("NBA Totals:")
  print(head(nba_totals))
  print("NBA Advanced:")
  print(head(nba_advanced))

  # Clean Player names in the advanced dataset: remove the asterisk and trim spaces
  nba_advanced <- nba_advanced %>%
    mutate(Player = trimws(gsub("\\*", "", Player)))  # Remove asterisk

  # Ensure that the advanced stats consider the cleaned player names
  nba_advanced <- nba_advanced %>%
    mutate(Player = trimws(Player))

  # Merge the datasets on Player, Pos, G, and MP
  nba_merge <- merge(nba_totals, nba_advanced, 
                     by = c("Player", "Pos", "G", "MP"), 
                     all.x = TRUE)

  # Debug: Show merged data sample
  print("Merged Data Sample:")
  print(head(nba_merge))
  
  # Check column names to confirm 'Team' exists
  print("Column Names Before Renaming:")
  print(colnames(nba_merge))

  # Clean 'Team' column: rename 'Tm' to 'Team' if present
  if ("Tm" %in% colnames(nba_merge)) {
    nba_merge <- nba_merge %>%
      rename(Team = Tm)
  }

  # Debug: Show column names after renaming
  print("Column Names After Renaming 'Tm' to 'Team':")
  print(colnames(nba_merge))

  # Handle duplicate columns like Team.x, Team.y, Awards.x, Awards.y
  duplicate_columns <- colnames(nba_merge)[grepl("\\.x$", colnames(nba_merge))]

  for (col in duplicate_columns) {
    # Extract the base name of the column (e.g., "Team" from "Team.x")
    base_col <- sub("\\.x$", "", col)
    
    # Merge the .x and .y columns into one
    if (paste0(base_col, ".y") %in% colnames(nba_merge)) {
      nba_merge <- nba_merge %>%
        mutate(!!base_col := coalesce(get(col), get(paste0(base_col, ".y")))) %>%
        select(-all_of(c(col, paste0(base_col, ".y"))))  # Drop the old columns
    }
  }

  # Remove players whose team is "TOT", "2Tm", or "3Tm"
  nba_merge <- nba_merge %>%
    filter(!grepl("^(TOT|2TM|3TM)$", Team))

  # Remove players with multiple positions
  nba_merge <- nba_merge %>%
    filter(!grepl("-", Pos))

  # Remove the 'X', 'X.1' columns if they exist
  columns_to_remove <- c("X", "X.1")
  nba_merge <- nba_merge %>%
    select(-any_of(columns_to_remove))  # Remove specified columns if they exist

  # Merge 'Rk.x' and 'Rk.y' columns
  if ("Rk.x" %in% names(nba_merge) & "Rk.y" %in% names(nba_merge)) {
    nba_merge <- nba_merge %>%
      mutate(Rk = coalesce(as.character(Rk.x), as.character(Rk.y))) %>%
      select(-Rk.x, -Rk.y)
  }

  # Merge 'Age.x' and 'Age.y' columns
  if ("Age.x" %in% names(nba_merge) & "Age.y" %in% names(nba_merge)) {
    nba_merge <- nba_merge %>%
      mutate(Age = coalesce(as.numeric(Age.x), as.numeric(Age.y))) %>%
      select(-c(Age.x, Age.y))
  }

  # Reorder columns for clarity
  column_order <- c("Player", "Pos", "Age", "Rk", "G", "MP", "Team")
  nba_merge <- nba_merge %>%
    select(all_of(column_order), everything())

  # Return the cleaned and merged dataset
  return(nba_merge)
}

# Example usage
nba_data_2013 <- get_cleaned_nba_stats(2013)

# View the first few rows of the cleaned dataset
head(nba_data_2013)


```


```{r}
nba_data_2022 <-get_cleaned_nba_stats(2022)
nba_data_2010 <-get_cleaned_nba_stats(2010)
nba_data_2015 <-get_cleaned_nba_stats(2015)
nba_data_2011<-get_cleaned_nba_stats(2011)
nba_data_2012<-get_cleaned_nba_stats(2012)
nba_data_2009<-get_cleaned_nba_stats(2009)
nba_data_2008<-get_cleaned_nba_stats(2008)
nba_data_2007<-get_cleaned_nba_stats(2007)
nba_data_2006<-get_cleaned_nba_stats(2006)
nba_data_2005<-get_cleaned_nba_stats(2005)
nba_data_2004<-get_cleaned_nba_stats(2004)
nba_data_2003<-get_cleaned_nba_stats(2003)
nba_data_2002<-get_cleaned_nba_stats(2002)
nba_data_2001<-get_cleaned_nba_stats(2001)
nba_data_2000<-get_cleaned_nba_stats(2000)



nba_data_1999<-get_cleaned_nba_stats(1999)
nba_data_1998<-get_cleaned_nba_stats(1998)
nba_data_1997<-get_cleaned_nba_stats(1997)
nba_data_1996<-get_cleaned_nba_stats(1996)
nba_data_1995<-get_cleaned_nba_stats(1995)
nba_data_1994<-get_cleaned_nba_stats(1994)
nba_data_1993<-get_cleaned_nba_stats(1993)
nba_data_1992<-get_cleaned_nba_stats(1992)
nba_data_1991<-get_cleaned_nba_stats(1991)
nba_data_1990<-get_cleaned_nba_stats(1990)
nba_data_1989<-get_cleaned_nba_stats(1989)
nba_data_1988<-get_cleaned_nba_stats(1988)
nba_data_1987<-get_cleaned_nba_stats(1987)
nba_data_1986<-get_cleaned_nba_stats(1986)
nba_data_1985<-get_cleaned_nba_stats(1985)
nba_data_1984<-get_cleaned_nba_stats(1984)
nba_data_1983<-get_cleaned_nba_stats(1983)
nba_data_1982<-get_cleaned_nba_stats(1982)
nba_data_1981<-get_cleaned_nba_stats(1981)
nba_data_1980<-get_cleaned_nba_stats(1980)

readLines("https://www.basketball-reference.com/teams/LAL/2023.html", n = 1)

```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*


File locator
```{r}
# Save your dataframe as a CSV file
write.csv(nba_roster2, file = "generalstats.csv", row.names = FALSE)
write.csv(AO_nba_advanced_stats2, file = "advancedstats.csv", row.names = FALSE)
write.csv(nba_data_2023, file = "nba2023.csv", row.names = FALSE)
write.csv(nba_data_2013, file = "nba2013.csv", row.names = FALSE)
getwd()

```






>>>>>>> Stashed changes